knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
For our final project, we located a collection of data sets containing information regarding the current coronavirus pandemic. The data can be found on kaggle.com under the headline “Uncover Covid-19 Challenge”. This collection contains dozens of data sets covering topics from Covid-19 cases, airport closings, mobility, ICU admissions, blood work of patients, and more. We do not use most of these files, but have chosen a few of them to conduct our analysis on. They are mentioned throughout the project when we use them. In our project, we use various statistical models, at first to examine which types of counties are more susceptible to the virus at large, and then to examine which types of patients are more at risk individually. As we go through the process, we keep a few research questions in mind: - What health and social vulnerability characteristics might lead to higher infection rates in different counties? - How can we use kernel smoothing splines and smoothing splines to approximate the rates of infections? What is the value of using these different approaches? - What demographics of patients are more likely to be diagnosed with COVID-19? Admitted to the ICU? - What blood characteristics are indicative of a patient with COVID-19? Admitted to the ICU?
One of the most basic, yet useful, statistical models is ordinary least squares linear regression. Using county-level data on Covid-19 cases and various demographic qualities of counties, we attempt to predict case rates and death rates with several variables of interest.
We start by loading packages we will need:
Now we load three datasets for county-level data. The first contains information on Covid-19 cases and deaths by county, the next contains health data by county, and the last contains social vulnerability data by county.
county_cases <- read_csv('./US_counties/county_case_data.csv')
county_health_data <- read_csv('./US_counties/county_health_rankings.csv')
county_svi <- read_csv('./US_counties/social_vulnerability_index.csv')
Now we clean the data and mutate new columns, one is a case rate and the other is a mortality rate (number of deaths per cases).
#cleaning health data
aian_indexes <- str_detect(colnames(county_health_data), 'aian')
black_indexes <- str_detect(colnames(county_health_data), 'black')
hispanic_indexes <- str_detect(colnames(county_health_data), 'hispanic')
white_indexes <- str_detect(colnames(county_health_data), 'white')
asian_indexes <- str_detect(colnames(county_health_data), 'asian')
percent_indexes <- str_detect(colnames(county_health_data), '95percent')
indexes <- NULL
for(i in 1:507) {
if(percent_indexes[i] == TRUE ||
asian_indexes[i] == TRUE ||
aian_indexes[i] == TRUE ||
black_indexes[i] == TRUE ||
hispanic_indexes[i] == TRUE ||
white_indexes[i] == TRUE) {
indexes[i] = i
}
else{
indexes[i] = 9
}
}
county_health_data_clean <- county_health_data[,-c(unique(indexes))]
#health stats and county case rates data set
health_and_cases <- left_join(county_cases,
county_health_data_clean,
by = c('fips'))
#cleaning
county_svi_cut <- county_svi[,-c(1:4)]
M_indexes <- str_detect(colnames(county_svi_cut), 'M_')
MP_indexes <- str_detect(colnames(county_svi_cut), 'MP_')
indexes2 <- NULL
for (i in 1:119) {
if(M_indexes[i] == TRUE ||
MP_indexes[i] == TRUE) {
indexes2[i] = i
}
else{
indexes2[i] = 5
}
}
#cleaning social vulnerability data
county_svi_subset <- county_svi_cut[,-c(unique(indexes2))]
colnames(county_svi_subset)[1] = 'fips'
county_svi_subset$fips <- as.numeric(county_svi_subset$fips)
#overall county data set, containing information about case rates, health statistics, and #social vulnerability metrics
county_overall <- left_join(health_and_cases,
county_svi_subset,
by = c('fips'))
county_overall_subset <- county_overall %>%
filter(date == '2020-04-27') %>%
filter(!is.na(fips))
county_overall_subset <- county_overall_subset %>%
mutate(covid_caserate = cases/population) %>%
mutate(covid_deathrate = deaths/cases)
Now, we are ready to run some linear regression. Let’s look at a graph of mortality rate and percent physically inactive. The hypothesis here is that the more physically inactive a county is, the higher the death rate will be there, as there has been scientific news linking fitness to lessening the risk of Covid-19.
county_overall_subset %>%
ggplot(aes(x = percent_physically_inactive, y = covid_deathrate)) +
geom_point() +
ggtitle("Covid-19 Mortality Rates in Counties by Physical Activity") +
xlab("Percent Physically Inactive") +
ylab("Covid-19 Mortality Rate") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 14)) +
theme(axis.text.x = element_text(color="black", size = 10)) +
theme(axis.text.y = element_text(color="black", size = 10)) +
theme(axis.title.x = element_text(color="black", size = 11)) +
theme(axis.title.y = element_text(color="black", size = 11)) +
theme(legend.position="none") +
theme(panel.grid.minor = element_blank())
It is quite hard to tell from this graph given the large amount of data points and their tendency to cluster toward lower values. Let’s try zooming in to only death rates of 0 to 0.25.
county_overall_subset %>%
filter(covid_deathrate < 0.25) %>%
ggplot(aes(x = percent_physically_inactive, y = covid_deathrate)) +
geom_point() +
ggtitle("Covid-19 Mortality Rates in Counties by Physical Activity") +
xlab("Percent Physically Inactive") +
ylab("Covid-19 Mortality Rate") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 14)) +
theme(axis.text.x = element_text(color="black", size = 10)) +
theme(axis.text.y = element_text(color="black", size = 10)) +
theme(axis.title.x = element_text(color="black", size = 11)) +
theme(axis.title.y = element_text(color="black", size = 11)) +
theme(legend.position="none") +
theme(panel.grid.minor = element_blank())
There still appears to not be too strong of a correlation, although often times it is hard to tell visually. We can use linear regression to determine if the relationship is statistically significantly above 0.
lm.model.1 <- lm(covid_deathrate ~ percent_physically_inactive,
data = county_overall_subset)
summary(lm.model.1)
##
## Call:
## lm(formula = covid_deathrate ~ percent_physically_inactive, data = county_overall_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03662 -0.03638 -0.03622 0.01356 0.96370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.608e-02 6.918e-03 5.215 1.97e-07 ***
## percent_physically_inactive 1.094e-05 2.474e-04 0.044 0.965
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0763 on 2799 degrees of freedom
## Multiple R-squared: 6.99e-07, Adjusted R-squared: -0.0003566
## F-statistic: 0.001956 on 1 and 2799 DF, p-value: 0.9647
In looking at the summary output, the coefficient on percent physically inactive is indeed positive, which is consistent with our intuition. However, the p-value attached to this is high, suggesting that we can not conclude that the relationship observed is statistically significant. This means that it could actually be 0 and the reason we observed what we did is just due to sampling variance.
One of the advantages of linear regression is that it can control for other relationships by adding them to the model. Physical inactivity may actually be significant once we control for things like how rural the county is, how much poverty is in the county, etc. We will run another regression model, this time with more variables.
lm.model.2 <- lm(covid_deathrate ~ percent_physically_inactive
+ percent_rural + percent_smokers
+ percent_adults_with_obesity
+ primary_care_physicians_rate
+ percent_fair_or_poor_health + EP_AGE65
+ EP_CROWD + E_GROUPQ + EP_POV
+ SPL_THEMES + RPL_THEMES,
data = county_overall_subset)
summary(lm.model.2)
##
## Call:
## lm(formula = covid_deathrate ~ percent_physically_inactive +
## percent_rural + percent_smokers + percent_adults_with_obesity +
## primary_care_physicians_rate + percent_fair_or_poor_health +
## EP_AGE65 + EP_CROWD + E_GROUPQ + EP_POV + SPL_THEMES + RPL_THEMES,
## data = county_overall_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06206 -0.03579 -0.02371 0.01325 0.96554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.642e-02 1.546e-02 1.709 0.0875 .
## percent_physically_inactive 3.685e-04 3.724e-04 0.990 0.3225
## percent_rural -1.626e-04 6.756e-05 -2.406 0.0162 *
## percent_smokers 5.532e-04 7.068e-04 0.783 0.4339
## percent_adults_with_obesity -4.735e-04 3.471e-04 -1.364 0.1726
## primary_care_physicians_rate 1.291e-05 5.184e-05 0.249 0.8033
## percent_fair_or_poor_health -1.061e-03 7.281e-04 -1.458 0.1450
## EP_AGE65 1.023e-03 4.075e-04 2.509 0.0122 *
## EP_CROWD -8.685e-04 9.231e-04 -0.941 0.3469
## E_GROUPQ 1.000e-07 2.418e-07 0.414 0.6792
## EP_POV 7.178e-04 4.949e-04 1.450 0.1471
## SPL_THEMES 8.207e-04 1.786e-03 0.460 0.6459
## RPL_THEMES -1.526e-03 1.633e-03 -0.934 0.3502
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07435 on 2680 degrees of freedom
## (108 observations deleted due to missingness)
## Multiple R-squared: 0.008107, Adjusted R-squared: 0.003666
## F-statistic: 1.825 on 12 and 2680 DF, p-value: 0.03917
In adding in many variables, we control for a lot of different characteristics of the county. It turns out that in this case, physical activity is still not statistically significant. However, two variables that we can be pretty sure have a significant relationship to the mortality rate of covid-19 are percent rural and estimated percentage of citizens above age 65. The second one is obvious; one of the main risk factors for passing away from the virus is age. The first one, percent rural, makes sense for case rates, as the more rural a county is, the less likely it is that covid-19 will spread quickly there. However, more rural counties, on average, have lower mortality rates, even after controlling for the health of the population and the poverty rate, among other variables. This could potentially be due to the fact that rural counties are experiencing lower case rates and therefore are able to treat the cases that they do have with more attention.
We can compare the effectiveness of the two models using a quantitative metric. Since linear regression models are not classifcation techniques, things like accuracy, precision, and recall do not apply. However, mean squared error (MSE) is a good default metric to use. Mean squared error represents the average squared distance that the actual data point is from the value the model predicts.
mean(lm.model.1$residuals^2)
## [1] 0.005818176
mean(lm.model.2$residuals^2)
## [1] 0.005501269
Not surprisingly, the MSE of the second model, which includes more information for the model to predict on, is lower. However, not by much. This is because both models are pretty bad, in the sense that the relationships are often not significant.
Why is it that these models are performing quite poorly? One possible explanation is that they do not control for testing. The data we have on case rates and mortality rates inherently depends on how many tests are being given. Currently, there is a large issue of lack of tests in the U.S., which may inherently hide the actual relationship between case rates and mortality rates and the variables used in the models above. We will look at this later using New York testing data.
Another explanation could just be that there is an element of arbitrariness involved in how Covid-19 has spread. Given how quickly this scenario is evolving, mortality rates in different counties could be subject to the somewhat unpredictable element of where the virus spread first. Over time, the predictive power of certain variables may appear, but for now, it could be getting masked by the fact that the virus is spreading quickly and just so happened to hit certain areas first. (Literature surrounding other viruses lends credence to this hypothesis, such as information about the flu, which has had many years to “stablize” all around the country. Such as the following article: https://psmag.com/social-justice/the-flu-hits-harder-in-poorer-neighborhoods. )
Other models that can be used to graph relationships between two quantitative variables include splines and smoothing splines. These techniques capture non-linear relationships, which often invalidate linear regression. However, based on the graphs above, the data seems to be more scattered than representing any sort of relationship, so splines won’t work.
Classification models, such as KNN, LDA, and random forests, are not easily used to predict quantitative outcomes. We could convert one of our variables of interest in this case to a categorical variable. For example, a binary indicator of whether the mortality rate is over, say, 3 percent. However, this loses a lot of information about the magnitude of the outcome variable. We will use classification models later on data more suited for them.
It might make sense to use some unsupervised learning to see if we can uncover any other unexpected relationships in our data set that could explain the results we are seeing. To start, we’d like to explore the relationship between the number of tests and the rate of infection reported. We use a dataset that includes county-level testing information for the state of New York.
ny_tests <- read_csv('./US_counties/ny_testing.csv')
We clean and join the data, then filter our overall county data set by counties in New York. We decide to filter the data to look at a single date - April 26, 2020. Our rationale behind this choice is that Covid had been detected in every county in New York by this time, and the rate of infection had been growing in each county up until this point. By isolating a single day, we might be able to better see how different county health and/or social vulnerability predictors affected the growth rate of covid in any given New York county. By April 26, the virus had the same amount of time to affect a county, but we will see that different counties have different levels of infection.
ny_overall <- county_overall %>%
filter(state.x == 'New York')
ny_overall_testing <- left_join(ny_overall,
ny_tests,
by = c('county.x' = 'county',
'date' = 'test_date'))
ny_overall_testing <- ny_overall_testing %>%
filter(county.x != 'New York City') %>%
mutate(case_pop_proportion = cases / E_TOTPOP) %>%
mutate(death_pop_proportion = deaths / E_TOTPOP) %>%
mutate(pop_density = E_TOTPOP / AREA_SQMI)
ny_overall_testing2 <- ny_overall_testing %>%
filter(date == '2020-04-26')
We also mutated a few columns to help create some new linear models in the future. We have added a population density column, a cases-to-population ratio response variable, and a deaths-to-population ratio response variable. It’s time to see if unsupervised learning can help us create a better model. First we’ll try hierarchical clustering. We subset the data by only including columns that seem relevant to predicting infection rate. We then scale these metrics, and then create a distance matrix with the results. Finally, using the distance matrix we can create a hierarchical clustering model for the data.
ny_numbers <- ny_overall_testing2[,c(12,14,16,21,23,27,31,43,45,46,53,54,61,64,76,77,97,99,100,158, 167, 170)]
ny_numbers_scaled <- scale(ny_numbers)
ny_numbers_dist <- dist(ny_numbers_scaled)
hc1 <- hclust(ny_numbers_dist)
Using the library dendextend we can visualize a dendrogram that displays the clustering. We create two visualizations - comparing them might help us find predictors. Our first clustering uses the county names as labels; our second uses the number of cases reported on April 26 as the label.
hc1 %>%
as.dendrogram() %>%
place_labels(ny_overall_testing2$county.x) %>%
set('labels_cex', .3) %>%
color_branches(k = 3) %>%
color_unique_labels() %>%
plot()
hc1 %>%
as.dendrogram() %>%
place_labels(ny_overall_testing2$cases) %>%
set('labels_cex', .3) %>%
color_branches(k = 3) %>%
color_unique_labels() %>%
plot()
Interestingly, some of our highest numbers of cases are all occurring in the same red cluser on the left of the dendrogram. The dendrogram indicates that these 4 counties - Rockland, Westchester, Nassau, and Suffolk - are all significantly different from the rest of the counties in New York across the predictor variables we chose. Their branch on the dendrogram occuras at around 15, the highest y-vale. Additionally, these 4 counties seem to have the most cases of covid. We summarize across the statistics we chose to see how the predictors vary in these 4 counties, this time including county name and case numbers to aid in visualization.
ny_hclust_summary <- ny_overall_testing2[,c(2,5,12,14,16,21,23,27,31,43,45,46,53,54,61,64,76,77,97,99,100,158,167,170)]
ny_hclust_summary %>%
mutate(cluster_test = ifelse(county.x %in% c('Rockland','Westchester', 'Suffolk', 'Nassau'), 1, 0)) %>%
group_by(cluster_test) %>%
summarize_all(mean)
## # A tibble: 2 x 25
## cluster_test county.x cases percent_fair_or… average_number_…
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 NA 459. 14.8 3.61
## 2 1 NA 26375. 14.3 3.18
## # … with 20 more variables:
## # average_number_of_mentally_unhealthy_days <dbl>,
## # percent_smokers <dbl>, percent_adults_with_obesity <dbl>,
## # percent_physically_inactive <dbl>, percent_excessive_drinking <dbl>,
## # percent_uninsured <dbl>, num_primary_care_physicians <dbl>,
## # primary_care_physicians_rate <dbl>, num_mental_health_providers <dbl>,
## # mental_health_provider_rate <dbl>, percent_vaccinated <dbl>,
## # high_school_graduation_rate <dbl>, `80th_percentile_income` <dbl>,
## # `20th_percentile_income` <dbl>, percent_severe_housing_problems <dbl>,
## # overcrowding <dbl>, inadequate_facilities <dbl>,
## # percent_65_and_over <dbl>, percent_rural <dbl>, E_TOTPOP <dbl>
These results are very interesting. Across many predictors that we would assume to have a positive correlation with the rate of infection, our data shows otherwise. In the four counties listed above, we see that the average number of infections is 26,375 compared to 458 in the rest of New York. However, we see that the fewer people in those people are of fair or poor health; they have fewer physically unhealthy days; they have fewer mentally unhealthy days; a lower percentage are smokers; a lower percentage are obese; a lower percent are phyically inactive; a smaller percent drink excessively; more people are insured; there are more primary care_physicians and mental health providers; more people are vaccinated; there’s a higher high school graduation rate; income is higher; a smaller percentage of the population is 65 and older. It seems that all of our common sense understanding about what contributes to a person’s health is turned up! However, in the last few columns, we get a hint about why the significant difference in number of cases exists. The four counties in question are 1.7 percent rural compared to 51.7 average in the other counties. Additionally, their population are an order of magnitude greater than the other counties on average. They also have a significantly higher percentage of severe housing problems, and more overcrowding per household.
All of this information helps to make a good argument for social distancing protocols. It really does seem that the proximity of people alone is a very good indicator for how many people end up getting infected. We decide to see if clustering by kmeans could lead to similar insights. Because there isn’t any inherent number of clusters, we test a range of values. Our results from hierarchical clustering indicate that 3 clusters might be a good number.
tot <- NULL
for(i in 1:10){
km <- kmeans(ny_numbers_scaled,
i)
tot[i] <- km$tot.withinss/i
}
plot(tot)
We see that 3 clusters is a pretty good choice. In fact, using 4 clusters results in a low sum of squares within clusters, and adding more clusters only has a minimal effect on reducing the sum of squares. Let’s try kmeans with 4 clusters. We bind the cluster data with the New York data to see how these clusters vary across the predictors.
km1 <- kmeans(ny_numbers_scaled,
4)
ny_clusters <- cbind(ny_overall_testing2, km1$cluster)
ny_clusters %>%
group_by(km1$cluster) %>%
summarize(average_cases = mean(cases),
average_deaths = mean(deaths),
average_pop = mean(E_TOTPOP))
## # A tibble: 4 x 4
## `km1$cluster` average_cases average_deaths average_pop
## <int> <dbl> <dbl> <dbl>
## 1 1 294. 16.7 142234.
## 2 2 109. 1.95 66379.
## 3 3 26375. 1120 1034242.
## 4 4 1094. 34.1 215246.
We see that cluster 4 has signicantly more cases than any of the other clusters - it also has the largest average population by quite a bit. Notably, the cluster with the second highest number of average cases has the second highest average population. The names of the 4 counties in cluster 4? The same 4 that we saw with the highest case rates in our hierarchical clustering.
ny_clusters %>%
as.data.frame() %>%
ggplot(aes(x = E_TOTPOP,
y = cases)) +
geom_bar(aes(color = factor(km1$cluster)),
stat = 'identity') +
xlab('Estimated Total Population') +
ylab('Number of Cases, April 26, 2020') +
ggtitle('Cases as a Function of Total Population')
Finally, we try PCA. We select the predictors that hierarchical clustering, kmeans clustering, and our initial linear models tell us are significant: percent severe housing problems, overcrowding (which indicates overcrowded living spaces), percent age 65 and older, percent rural, and estimated total population.
ny_pca_numbers <- ny_overall_testing2[,c(5,97,99,158,167,170)]
row.names(ny_pca_numbers) <- ny_overall_testing2$county.x
#ok let's try to reduce our variables using PCA
pca1 <- prcomp(ny_pca_numbers,
scale. = TRUE)
library(pls)
#biplot(pca1, xlim=c(0, 0.5), ylim=c(-0.2, 0.4))
biplot(pca1)
Our resulting biplots are crowded and difficult to read, but we do see the trends present in our kmeans and hierarchical clustering. Rockland, Westchester, Nassau, and Suffolk counties are all similarly displaced along the estimated total population axis, which seems to have a strong correlation with the cases axis. Severe housing problems also seems correlated with the number of cases - Rockland and Westchester appear to have the highest percentages of severe housing problems. Overcrowding is also slightly correlated with the number of casese, and Rockland seems to be the county with the greatest degree of overcrowding.
It’s also important to note that the percent of people age 65 and older and the percent rural axes appear anti-correlated with the number of cases. If we were looking at deaths instead of cases, we might see a correlation between the percent of people age 65 and older and the number of deaths, because covid, like other flus, appears to have more adverse affects on those with weakened immune systems and other preconditions. As we might expect, rural areas of New York seem to have lower rates of infection as people are not in as close of contact.
With what we’ve learned from these unsupervised learning methods, we attempt to create better linear models.
lm.model.3 <- lm(cases ~ E_TOTPOP + percent_rural + overcrowding + percent_severe_housing_problems + EP_AGE65,
data = ny_overall_testing2)
summary(lm.model.3)
##
## Call:
## lm(formula = cases ~ E_TOTPOP + percent_rural + overcrowding +
## percent_severe_housing_problems + EP_AGE65, data = ny_overall_testing2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11397 -886 240 1387 8176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.147e+03 3.981e+03 -2.298 0.0257 *
## E_TOTPOP 2.100e-02 1.930e-03 10.880 6.73e-15 ***
## percent_rural 4.975e+01 2.590e+01 1.921 0.0603 .
## overcrowding 1.527e+03 6.794e+02 2.247 0.0290 *
## percent_severe_housing_problems 1.648e+02 2.199e+02 0.749 0.4571
## EP_AGE65 -2.135e+01 1.914e+02 -0.112 0.9116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3001 on 51 degrees of freedom
## Multiple R-squared: 0.8419, Adjusted R-squared: 0.8264
## F-statistic: 54.33 on 5 and 51 DF, p-value: < 2.2e-16
Wow, our R-squared value for this model is significantly higher than our previous models. It’s possible that this is due in part to the smaller size of data set. Total population, percent rural, and overcrowding are all statistically signicant predictors, but percent severe housing problems and percent of people over age 65 are not. We’ll take out those predictors, and try a new model. This time we’ll include information about how many tests were administered.
lm.model.4 <- lm(cases ~ E_TOTPOP + percent_rural + overcrowding + cumulative_number_of_tests_performed,
data = ny_overall_testing2)
summary(lm.model.4)
##
## Call:
## lm(formula = cases ~ E_TOTPOP + percent_rural + overcrowding +
## cumulative_number_of_tests_performed, data = ny_overall_testing2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4111.6 -168.9 97.6 258.6 2670.1
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.303e+03 5.412e+02 -2.407
## E_TOTPOP -8.569e-04 1.137e-03 -0.753
## percent_rural 1.674e+01 6.662e+00 2.512
## overcrowding -3.469e+01 1.570e+02 -0.221
## cumulative_number_of_tests_performed 3.875e-01 1.727e-02 22.440
## Pr(>|t|)
## (Intercept) 0.0197 *
## E_TOTPOP 0.4546
## percent_rural 0.0151 *
## overcrowding 0.8260
## cumulative_number_of_tests_performed <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 914.1 on 52 degrees of freedom
## Multiple R-squared: 0.985, Adjusted R-squared: 0.9839
## F-statistic: 856.1 on 4 and 52 DF, p-value: < 2.2e-16
Now this result is interesting when compared to our previous model. First of all, our R-squared value is 0.9839. The cumulative number of tests performed is very statistically significant, and explains almost all of the variability in our cases by county data. Moreover, estimated population is no longer a significant predictor when controlled for by the number of tests administered. Perhaps estimated total population and cumulative number of tests performed are highly correlated variables, so including both in the model is not useful due to the fact that the number of tests administered is a better predictor.
The fact that the number of tests administered is a good predictor is not surprising - theoretically, those that are getting tested have a very high likelihood of having covid, otherwise they would not need to get tested. It has been stated that the US has not had enough testing units available, and one does wonder if areas with lower rates of testing would have higher rates of reported cases if they simply had more tests to administer. It would be difficult to learn about this from our data set in its current state.
We try expanding the size of our dataset to see if there is any difference in our models.
ny_overall_testing_daterange <- ny_overall_testing %>%
filter(county.x != 'New York City') %>%
mutate(case_pop_proportion = cases / E_TOTPOP) %>%
mutate(death_pop_proportion = deaths / E_TOTPOP) %>%
mutate(pop_density = E_TOTPOP / AREA_SQMI) %>%
filter(date == '2020-04-26' |
date == '2020-04-25' |
date == '2020-04-24' |
date == '2020-04-23' |
date == '2020-04-22' |
date == '2020-04-21' |
date == '2020-04-20')
lm.model.5 <- lm(cases ~ E_TOTPOP + percent_rural + overcrowding + date,
data = ny_overall_testing_daterange)
summary(lm.model.5)
##
## Call:
## lm(formula = cases ~ E_TOTPOP + percent_rural + overcrowding +
## date, data = ny_overall_testing_daterange)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11378.7 -852.3 239.5 1153.2 9799.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.077e+06 1.259e+06 -0.855 0.393
## E_TOTPOP 2.001e-02 6.334e-04 31.591 < 2e-16 ***
## percent_rural 4.271e+01 7.412e+00 5.762 1.68e-08 ***
## overcrowding 1.763e+03 1.478e+02 11.933 < 2e-16 ***
## date 5.821e+01 6.850e+01 0.850 0.396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2737 on 394 degrees of freedom
## Multiple R-squared: 0.8337, Adjusted R-squared: 0.832
## F-statistic: 493.8 on 4 and 394 DF, p-value: < 2.2e-16
We see that total population, percent rural, and overcrowding are all very significant predictors, and our R-squared value of .832 is quite high.
We’d now like to see how well we can model the infection rate curves for different counties across New York.
mostcase_counties <- ny_overall_testing %>%
filter(county.x != 'New York City') %>%
mutate(first_case = ifelse(cases>0, 1, 0)) %>%
filter(first_case == 1) %>%
filter(date == '2020-04-27') %>%
arrange(-cases) %>%
head(10) %>%
select(county.x)
mostcase_counties
## # A tibble: 10 x 1
## county.x
## <chr>
## 1 Nassau
## 2 Suffolk
## 3 Westchester
## 4 Rockland
## 5 Orange
## 6 Erie
## 7 Dutchess
## 8 Monroe
## 9 Ulster
## 10 Albany
#create plot showing case rate curves by county
ny_overall_testing %>%
mutate(first_case = ifelse(cases>0, 1, 0)) %>%
filter(first_case == 1) %>%
filter(county.x %in% mostcase_counties$county.x) %>%
ggplot(aes(x = date,
y = cases)) +
geom_point(aes(color = factor(county.x))) +
ggtitle('Infection Rates in NY Counties') +
ylab('Number of Cases') +
xlab('Date')
We see that the top 5 case-rate counties have significantly more cases than the others - we’ll focus on these to make our graphical analysis more clear. First we’ll try kernel smoothing to approximate these curves.
mostcase_counties2 <- as.data.frame(ny_overall_testing) %>%
filter(county.x != 'New York City') %>%
mutate(first_case = ifelse(cases>0, 1, 0)) %>%
filter(first_case == 1) %>%
filter(date == '2020-04-27') %>%
arrange(-cases) %>%
head(5) %>%
select(county.x)
ny_top5 <- ny_overall_testing %>%
filter(county.x != 'New York City') %>%
mutate(first_case = ifelse(cases>0, 1, 0)) %>%
filter(first_case == 1) %>%
filter(county.x %in% mostcase_counties2$county.x)
#first row with data from all counties
ny_top5 <- ny_top5[-c(1:25),]
date_nums = NULL
for (i in 1:47) {
start_index = 5*(i-1)
for(j in 1:5){
date_nums[start_index+j] = i
}
}
#add date number columns, represents number of days after March 12
ny_top5_datenums <- cbind(ny_top5, date_nums)
top_counties <- list()
for(i in 1:5){
top_counties[[i]] <- ny_top5_datenums %>%
filter(county.x == mostcase_counties2$county.x[i])
}
ksmooths <- list()
#create a kernel smoothing model for each curve
for(i in 1:5) {
# use kernel smoothing to approximate total cases curve in the US
currentksmooth <- ksmooth(x = top_counties[[i]]$date_nums,
y = top_counties[[i]]$cases,
kernel = 'normal',
bandwidth = 5,
x.points = top_counties[[i]]$date_num)
ksmooths[[i]] <- currentksmooth
}
top5_caseplot <- ny_top5_datenums %>%
mutate(first_case = ifelse(cases>0, 1, 0)) %>%
filter(first_case == 1) %>%
filter(county.x %in% mostcase_counties2$county.x) %>%
ggplot(aes(x = date_nums,
y = cases)) +
geom_point(aes(color = factor(county.x))) +
ggtitle('Infection Rates in NY Counties') +
ylab('Number of Cases') +
xlab('Date')
top5_caseplot +
geom_line(data = data.frame(x = ksmooths[[1]]$x,
y = ksmooths[[1]]$y),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = ksmooths[[2]]$x,
y = ksmooths[[2]]$y),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = ksmooths[[3]]$x,
y = ksmooths[[3]]$y),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = ksmooths[[4]]$x,
y = ksmooths[[4]]$y),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = ksmooths[[5]]$x,
y = ksmooths[[5]]$y),
mapping = aes(x = x,
y = y)) +
ggtitle('US Total Cases Plot Overlaid with Smoothing Kernels')
We see that kernel smoothing does a very good job of approximating the infection rate curves for these counties. We can quantify the error by calculating the root mean squared error for the kernel smoothing predictions compared to the actual case numbers.
smooth_rmses <- list()
for(i in 1:5) {
#calculating rmse
smooth_rmses[[i]] <- sqrt(mean(sum((ksmooths[[i]]$y - top_counties[[i]]$cases)^2)))
}
print(smooth_rmses)
## [[1]]
## [1] 1489.278
##
## [[2]]
## [1] 1589.873
##
## [[3]]
## [1] 1274.709
##
## [[4]]
## [1] 819.4532
##
## [[5]]
## [1] 693.8694
Are these values for a root mean squared error good or bad? We can see how the values would change if we used a different bandwidth when calculating our smoothing function. Because all the counties use the same date numbers, we can use one to understand the effect of changing bandwidth.
#Can we reduce rmse by changing bandwidth?
tot_rmse <- NULL
for(i in 1:20) {
k <- ksmooth(x = top_counties[[1]]$date_nums,
y = top_counties[[1]]$cases,
kernel = 'normal',
bandwidth = i,
x.points = top_counties[[1]]$date_num)
tot_rmse[i] <- sqrt(mean(sum((k$y - top_counties[[1]]$cases)^2)))
}
plot(tot_rmse)
As we choose a wider bandwidth, our total rmse increases. We want to be careful that we don’t overfit our data, but trying a bandwidth of 3 could improve our kernel smoothing models. Next we try approximating the infection curves using splines.
library(splines)
splines <- list()
spline_preds <- list()
spline_rmses <- list()
for(i in 1:5) {
#creating spline models
splines[[i]] <- lm(top_counties[[i]]$cases ~ bs(top_counties[[i]]$date_nums, knots = c(5, 25, 45)),
data = ny_top5_datenums)
spline_preds[[i]] <- predict(splines[[i]], data.frame(top_counties[[i]]$date_nums))
spline_rmses[[i]] <- sqrt(mean(sum((spline_preds[[i]] - top_counties[[i]]$cases)^2)))
}
print(spline_rmses)
## [[1]]
## [1] 2431.912
##
## [[2]]
## [1] 2738.438
##
## [[3]]
## [1] 2314.419
##
## [[4]]
## [1] 1348.303
##
## [[5]]
## [1] 1155.995
top5_caseplot +
geom_line(data = data.frame(x = top_counties[[i]]$date_nums,
y = spline_preds[[1]]),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = top_counties[[i]]$date_nums,
y = spline_preds[[2]]),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = top_counties[[i]]$date_nums,
y = spline_preds[[3]]),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = top_counties[[i]]$date_nums,
y = spline_preds[[4]]),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = top_counties[[i]]$date_nums,
y = spline_preds[[5]]),
mapping = aes(x = x,
y = y)) +
ggtitle('US Total Cases Plot Overlaid with Splines')
The splines are also doing a great job of approximating the infection curves. Comparing rmses, it appears that kernel smoothing slightly outperformed splines. However, we didn’t try a range of knots. This time we’ll try a range of knots across the whole date number range.
knot_vals <- NULL
for(i in 1:15) {
knot_vals[i] = 3*i
}
library(splines)
splines2 <- list()
spline_preds2 <- list()
spline_rmses2 <- list()
for(i in 1:5) {
#creating spline models
splines2[[i]] <- lm(top_counties[[i]]$cases ~ bs(top_counties[[i]]$date_nums, knots = knot_vals),
data = ny_top5_datenums)
spline_preds2[[i]] <- predict(splines2[[i]], data.frame(top_counties[[i]]$date_nums))
spline_rmses2[[i]] <- sqrt(mean(sum((spline_preds2[[i]] - top_counties[[i]]$cases)^2)))
}
print(spline_rmses2)
## [[1]]
## [1] 532.8363
##
## [[2]]
## [1] 918.7758
##
## [[3]]
## [1] 462.5979
##
## [[4]]
## [1] 441.0396
##
## [[5]]
## [1] 364.9942
top5_caseplot +
geom_line(data = data.frame(x = top_counties[[i]]$date_nums,
y = spline_preds2[[1]]),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = top_counties[[i]]$date_nums,
y = spline_preds2[[2]]),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = top_counties[[i]]$date_nums,
y = spline_preds2[[3]]),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = top_counties[[i]]$date_nums,
y = spline_preds2[[4]]),
mapping = aes(x = x,
y = y)) +
geom_line(data = data.frame(x = top_counties[[i]]$date_nums,
y = spline_preds2[[5]]),
mapping = aes(x = x,
y = y)) +
ggtitle('US Total Cases Plot Overlaid with Splines')
Oh wow those splines look quite good! In fact, by adding knots, our splines now outperform the kernel smoothing by the metric of root mean squared error. Finally we try a smooth spline.
smooth_splines <- list()
smooth_spline_rmses <- list()
for(i in 1:5) {
#creating smooth spline models
smooth_splines[[i]] <- smooth.spline(top_counties[[i]]$date_nums,
top_counties[[i]]$cases)
smooth_spline_rmses[i] <- sqrt(mean(sum((smooth_splines[[i]]$y - top_counties[[i]]$cases)^2)))
}
print(smooth_spline_rmses)
## [[1]]
## [1] 1.486028e-08
##
## [[2]]
## [1] 774.3128
##
## [[3]]
## [1] 3.831094e-09
##
## [[4]]
## [1] 255.0054
##
## [[5]]
## [1] 2.148871e-10
top5_caseplot +
geom_line(data = data.frame(date_nums = smooth_splines[[1]]$x,
cases = smooth_splines[[1]]$y)) +
geom_line(data = data.frame(date_nums = smooth_splines[[2]]$x,
cases = smooth_splines[[2]]$y)) +
geom_line(data = data.frame(date_nums = smooth_splines[[3]]$x,
cases = smooth_splines[[3]]$y)) +
geom_line(data = data.frame(date_nums = smooth_splines[[4]]$x,
cases = smooth_splines[[4]]$y)) +
geom_line(data = data.frame(date_nums = smooth_splines[[5]]$x,
cases = smooth_splines[[5]]$y)) +
ggtitle('US Total Cases Plot Overlaid with Smoothing Splines')
Oh wow do those smoothing splines look good! Looking at the rmses, smoothing splines outperform both basic splines and kernel smoothing - one wonders if in fact we may have overfitted the data. When working with any of the techniques above, we must remember that they are effective at interpolation of our data, but not at extrapolation. Our rmse values are very low for all three models, but the predictive power of the models comes from the date numbers that we provide - trying to predict outside this range would not produce realistic resuls.
While kernel smoothing and splines might be common visualizations to show the need for “flattening the curve,” because they do not provide an opportunity for extrapolation, we don’t see much other value in using these techniques in this context.
Now we will explore how bloodwork factors into COVID-19 testing results and ICU admittance with LDA and KNN.
We will be using the clinical spectrum data set for K Nearest Neighbors (KNN) and Linear Discriminant Analysis (LDA) supervised learning. First, we will clean the data. This involves removing all of the NAs and only keeping relevant columns.
# Remove all NAs and only keep relevant columns
clinical_spectrum_filtered <- clinical_spectrum[,c(1:39)]
clinical_spectrum_filtered <- clinical_spectrum_filtered[,-c(21, 28)]
clinical_spectrum_clean <- clinical_spectrum_filtered %>%
na.omit()
# Now, convert test result to colors so we can plot.
# Red = positive COVID-19 test, Blue = negative
clinical_spectrum_cleanPlot <- clinical_spectrum_clean %>%
mutate(test_result = ifelse(
clinical_spectrum_clean$sars_cov_2_exam_result == 'negative',
'Blue',
'Red'))
# Let's take a look at how platelet levels and hematocrit effect
# test outcome
plotData <- as.data.frame(
clinical_spectrum_cleanPlot[, c("test_result", "platelets", "hematocrit")])
plot( plotData[ , c(2, 3) ],
col=plotData[ ,1 ],
main = "COVID-19 Test Results by Platelets and Hematocrit")
# How about leukocytes and age?
plotData <- as.data.frame(
clinical_spectrum_cleanPlot[, c("test_result", "leukocytes", "patient_age_quantile")])
plot( plotData[ , c(2, 3) ],
col=plotData[ ,1 ],
main = "COVID-19 Test Results by Leukocytes and Age")
These charts show that in general, leukocyte levels, age, platelet count, and hematocrit are all correlated with COVID-19 test results. We tried several different variables, these four showed a promising relationship. In general, as platelet levels, hematocrit, and age increase, the likelihood of a positive test increases. Whereas as leukocyte levels increase, the likelyhood of a positive test decreases.
But, how do these same four factors effect likelihood of being admitted to the ICU?
# Now, convert icu result to colors so we can plot.
# Red = admitted, Blue = not
clinical_spectrum_cleanPlot <- clinical_spectrum_clean %>%
mutate(icu_result = ifelse(
clinical_spectrum_clean$patient_addmited_to_intensive_care_unit_1_yes_0_no == 'FALSE',
'Blue',
'Red'))
# Let's check out the save four variables for ICU admittance
plotData <- as.data.frame(
clinical_spectrum_cleanPlot[, c("icu_result", "platelets", "hematocrit")])
plot( plotData[ , c(2, 3) ],
col=plotData[ ,1 ],
main = "ICU Admittance by Platelets and Hematocrit")
plotData <- as.data.frame(
clinical_spectrum_cleanPlot[, c("icu_result", "leukocytes", "patient_age_quantile")])
plot( plotData[ , c(2, 3) ],
col=plotData[ ,1 ],
main = "ICU Admittance by Leukocytes and Age")
Unlike test results, these two charts indicate that the four factors have little to do with whether or not the patient was admitted to the ICU as the red and blue dots are relatively evenly spread.
Now we will start implementing LDA.
library(MASS)
library(fmsb)
# We create a funciton for calculating accuracy from tab results
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
precision <- function(x){x[4]/(x[4] + x[2])}
recall <- function(x){x[4]/(x[4] + x[3])}
# Let's clean up the test results (A = negative, B = positive)
clinical_spectrum_clean <- clinical_spectrum_clean %>%
mutate(test_result = ifelse(
clinical_spectrum_clean$sars_cov_2_exam_result == 'negative',
'A',
'B'))
# Predicting covid test results from blood
clinical_spectrum_blood <- clinical_spectrum_clean[,c(7:20, 38)]
lda1 <- lda(test_result ~ .,
data = clinical_spectrum_blood,
CV = FALSE)
lda1Predict <- predict(lda1,
newdata=clinical_spectrum_blood
)
lda1tab <- table(lda1Predict$class, clinical_spectrum_blood$test_result)
# Let's calculate accuracy and kappa
lda1tab
##
## A B
## A 308 44
## B 4 6
cat("Accuracy: ", accuracy(lda1tab), "\n")
## Accuracy: 86.74033
cat("Precision: ", precision(lda1tab), "\n")
## Precision: 0.6
cat("Recall: ", recall(lda1tab), "\n")
## Recall: 0.12
lda1kappa <- Kappa.test(lda1tab)
cat("Kappa: ", as.numeric(lda1kappa$Result[2]), " ", lda1kappa$Judgement)
## Kappa: 0.16139 Slight agreement
Linear Discriminant Analysis implemented on this data set with all variables yields an 86.74% accuracy for predicting whether or not a patient has COVID-19. However, Kappa might be a better indication of the algorithm’s performance because if LDA predicts no COVID 100% of the time, the accuracy would still be high. Here, with a kappa value of 0.16, we see that LDA does not do a great job of predicting COVID. Kappa is a better metric of performance than accuracy because it accounts for likelihood of having a correct classification by chance. Precision and recall are also important for taking into account false positives and negatives. Recall is especially telling here with a low value of 0.12, saying that the algorithm is predicting positive only 6/50 times. This is very dangerous in a medical setting especially, where a false negative can lead to a patient not getting treatment. In this instance, our algorithm is overpredicting no COVID.
What about ICU rates?
# Predicting ICU test results from blood
clinical_spectrum_icu <- clinical_spectrum_clean[,c(6:20)]
ldaICU <- lda(patient_addmited_to_intensive_care_unit_1_yes_0_no ~ .,
data = clinical_spectrum_icu,
CV = FALSE)
ldaICUPredict <- predict(ldaICU,
newdata=clinical_spectrum_icu
)
ldaICUtab <- table(ldaICUPredict$class, clinical_spectrum_icu$patient_addmited_to_intensive_care_unit_1_yes_0_no)
ldaICUtab
##
## FALSE TRUE
## FALSE 334 16
## TRUE 6 6
cat("Accuracy: ", accuracy(ldaICUtab), "\n")
## Accuracy: 93.92265
cat("Precision: ", precision(ldaICUtab), "\n")
## Precision: 0.5
cat("Recall: ", recall(ldaICUtab), "\n")
## Recall: 0.2727273
lda1kappa <- Kappa.test(ldaICUtab)
cat("Kappa: ", as.numeric(lda1kappa$Result[2]), " ", lda1kappa$Judgement)
## Kappa: 0.3239389 Fair agreement
Linear Discriminant Analysis for predicting ICU admittance implemented on this data set with all variables yields an 93.92% accuracy. In addition to higher accuracy than COVID results, this LDA has a higher kappa value of 0.32, and a higher recall of 0.27. These results indicate that your bloodwork is a better indicator of whether a patient is admitted to the ICU than whether or not they have COVID.
Interesting. Let’s plot to see what is going on.
# Now, we will plot
ldaICUPredictions1 <- ldaICUPredict$posterior[,2]
finalICUData1 <- data.frame(clinical_spectrum_icu, ldaICUPredictions1)
finalICUData1 %>%
ggplot(aes(x = leukocytes,
y = monocytes)) +
geom_point(aes(color = ldaICUPredictions1)) +
ggtitle("ICU Predictions LDA") +
xlab("Leukocytes") +
ylab("Monocytes") +
labs(color = "Predicted ICU Prob")
ldaTestResultPredictions <- lda1Predict$posterior[,2]
finalICUData2 <- data.frame(clinical_spectrum_icu, ldaTestResultPredictions)
finalICUData2 %>%
ggplot(aes(x = leukocytes,
y = monocytes)) +
geom_point(aes(color = ldaTestResultPredictions)) +
ggtitle("Covid Test Result Predictions LDA") +
xlab("Leukocytes") +
ylab("Monocytes") +
labs(color = "Predicted ICU Prob")
These graphs show interesting results.The predicted COVID-19 positive test results probability is higher when monocytes are higher and leukocytes are lower (top-left), wheras the predicted ICU probability is higher in the opposite conditions, when leukocyte levels are higher and monocytes are lover (bottom-right).
library(class) # For KNN
set.seed(2)
# We will split the data into 75% training, 25% testing
train_ind <- sample(seq_len(nrow(clinical_spectrum_clean[,c(6:20)])),
size = floor(0.75 * nrow(clinical_spectrum_clean[,c(6:20)])))
trainICU <- clinical_spectrum_clean[,c(6:20)][train_ind, ]
testICU <- clinical_spectrum_clean[,c(6:20)][-train_ind, ]
# We will predict for ICU admittance first
pred <- knn(train = trainICU,
test = testICU,
cl = trainICU$patient_addmited_to_intensive_care_unit_1_yes_0_no,
k = 1)
tab <- table(pred, testICU$patient_addmited_to_intensive_care_unit_1_yes_0_no)
tab
##
## pred FALSE TRUE
## FALSE 83 5
## TRUE 2 1
cat("Accuracy: ", accuracy(tab), "\n")
## Accuracy: 92.30769
cat("Precision: ", precision(tab), "\n")
## Precision: 0.3333333
cat("Recall: ", recall(tab), "\n")
## Recall: 0.1666667
tabKappa <- Kappa.test(tab)
cat("Kappa: ", as.numeric(tabKappa$Result[2]), " ", tabKappa$Judgement)
## Kappa: 0.1864623 Slight agreement
Our KNN implementation yields a 92.31% accuracy on the test data when predicting for ICU admittance, compared to 93.92% using LDA. The Kappa for LDA was 0.32, KNN’s is significantly lower.
Let’s try KNN to predict COVID-19 test results.
# KNN for test results
clinical_spectrum_clean <- clinical_spectrum_clean %>%
mutate(test_result = ifelse(
clinical_spectrum_clean$sars_cov_2_exam_result == 'negative',
TRUE,
FALSE))
train_ind1 <- sample(seq_len(nrow(clinical_spectrum_clean[,c(7:20, 38)])),
size = floor(0.75 * nrow(clinical_spectrum_clean[,c(7:20, 38)])))
trainCovid <- clinical_spectrum_clean[,c(7:20, 38)][train_ind1, ]
testCovid <- clinical_spectrum_clean[,c(7:20, 38)][-train_ind1, ]
pred <- knn(train = trainCovid,
test = testCovid,
cl = trainCovid$test_result,
k = 1)
tab1 <- table(pred, testCovid$test_result)
tab1
##
## pred FALSE TRUE
## FALSE 10 7
## TRUE 5 69
cat("Accuracy: ", accuracy(tab1), "\n")
## Accuracy: 86.81319
cat("Precision: ", precision(tab1), "\n")
## Precision: 0.9324324
cat("Recall: ", recall(tab1), "\n")
## Recall: 0.9078947
tab1Kappa <- Kappa.test(tab1)
cat("Kappa: ", as.numeric(tab1Kappa$Result[2]), " ", tab1Kappa$Judgement)
## Kappa: 0.5453789 Moderate agreement
Our KNN implementation yields a 86.81% accuracy on the test data when predicting for COVID-19 test result, compared to 86.74% using LDA. Furthermore, KNN’s Kappa for COVID-19 prediction is 54.5% compared to 16.139% with LDA. Furthermore, the precision and recall are way higher in KNN (both over 90%). This shows that although algorithms may be comparable in terms of accuracy, when looking other metrics this algorithm clearly outperforms LDA. While LDA draws lines to classify data, KNN finds clusters. This shows that the data lends itself better to cluster categoriziation.